AllLife Bank is a US bank that has a growing customer base. The majority of these customers are liability customers (depositors) with varying sizes of deposits. The number of customers who are also borrowers (asset customers) is quite small, and the bank is interested in expanding this base rapidly to bring in more loan business and in the process, earn more through the interest on loans. In particular, the management wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors).
A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing department to devise campaigns with better target marketing to increase the success ratio.
You as a Data scientist at AllLife bank have to build a model that will help the marketing department to identify the potential customers who have a higher probability of purchasing the loan.
To predict whether a liability customer will buy personal loans, to understand which customer attributes are most significant in driving purchases, and identify which segment of customers to target more.
ID: Customer IDAge: Customer’s age in completed yearsExperience: #years of professional experienceIncome: Annual income of the customer (in thousand dollars)ZIP Code: Home Address ZIP code.Family: the Family size of the customerCCAvg: Average spending on credit cards per month (in thousand dollars)Education: Education Level. 1: Undergrad; 2: Graduate;3: Advanced/ProfessionalMortgage: Value of house mortgage if any. (in thousand dollars)Personal_Loan: Did this customer accept the personal loan offered in the last campaign? (0: No, 1: Yes)Securities_Account: Does the customer have securities account with the bank? (0: No, 1: Yes)CD_Account: Does the customer have a certificate of deposit (CD) account with the bank? (0: No, 1: Yes)Online: Do customers use internet banking facilities? (0: No, 1: Yes)CreditCard: Does the customer use a credit card issued by any other Bank (excluding All life Bank)? (0: No, 1: Yes)from google.colab import drive
drive.mount('/content/drive')
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
# Installing the libraries with the specified version.
#!pip install numpy==1.25.2 pandas==1.5.3 matplotlib==3.7.1 seaborn==0.13.1 scikit-learn==1.2.2 sklearn-pandas==2.2.0 -q --user
Note:
After running the above cell, kindly restart the notebook kernel (for Jupyter Notebook) or runtime (for Google Colab), write the relevant code for the project from the next cell, and run all cells sequentially from the next cell.
On executing the above line of code, you might see a warning regarding package dependencies. This error message can be ignored as the above code ensures that all necessary libraries and their dependencies are maintained to successfully execute the code in this notebook.
# to load and manipulate data
import pandas as pd
import numpy as np
# to visualize data
import matplotlib.pyplot as plt
import seaborn as sns
# to split data into training and test sets
from sklearn.model_selection import train_test_split
# to build decision tree model
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
# to tune different models
from sklearn.model_selection import GridSearchCV
# to compute classification metrics
from sklearn.metrics import (
confusion_matrix,
accuracy_score,
recall_score,
precision_score,
f1_score,
)
# loading data into a pandas dataframe
customer_details = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/AIMLCourse/Machine_Learning/Project02/Loan_Modelling.csv")
# Create a copy of data
data = customer_details.copy()
data.head(5)
| ID | Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 25 | 1 | 49 | 91107 | 4 | 1.6 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 1 | 2 | 45 | 19 | 34 | 90089 | 3 | 1.5 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 2 | 3 | 39 | 15 | 11 | 94720 | 1 | 1.0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 4 | 35 | 9 | 100 | 94112 | 1 | 2.7 | 2 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 5 | 35 | 8 | 45 | 91330 | 4 | 1.0 | 2 | 0 | 0 | 0 | 0 | 0 | 1 |
data.tail(5)
| ID | Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4995 | 4996 | 29 | 3 | 40 | 92697 | 1 | 1.9 | 3 | 0 | 0 | 0 | 0 | 1 | 0 |
| 4996 | 4997 | 30 | 4 | 15 | 92037 | 4 | 0.4 | 1 | 85 | 0 | 0 | 0 | 1 | 0 |
| 4997 | 4998 | 63 | 39 | 24 | 93023 | 2 | 0.3 | 3 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4998 | 4999 | 65 | 40 | 49 | 90034 | 3 | 0.5 | 2 | 0 | 0 | 0 | 0 | 1 | 0 |
| 4999 | 5000 | 28 | 4 | 83 | 92612 | 3 | 0.8 | 1 | 0 | 0 | 0 | 0 | 1 | 1 |
data.shape
(5000, 14)
The dataset has 5000 rows and 14 columns.
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5000 entries, 0 to 4999 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 5000 non-null int64 1 Age 5000 non-null int64 2 Experience 5000 non-null int64 3 Income 5000 non-null int64 4 ZIPCode 5000 non-null int64 5 Family 5000 non-null int64 6 CCAvg 5000 non-null float64 7 Education 5000 non-null int64 8 Mortgage 5000 non-null int64 9 Personal_Loan 5000 non-null int64 10 Securities_Account 5000 non-null int64 11 CD_Account 5000 non-null int64 12 Online 5000 non-null int64 13 CreditCard 5000 non-null int64 dtypes: float64(1), int64(13) memory usage: 547.0 KB
# Looking at the data ID column is not needed so it can be dropped
data.drop('ID', axis=1, inplace=True)
data.describe(include='all').T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Age | 5000.0 | 45.338400 | 11.463166 | 23.0 | 35.0 | 45.0 | 55.0 | 67.0 |
| Experience | 5000.0 | 20.104600 | 11.467954 | -3.0 | 10.0 | 20.0 | 30.0 | 43.0 |
| Income | 5000.0 | 73.774200 | 46.033729 | 8.0 | 39.0 | 64.0 | 98.0 | 224.0 |
| ZIPCode | 5000.0 | 93169.257000 | 1759.455086 | 90005.0 | 91911.0 | 93437.0 | 94608.0 | 96651.0 |
| Family | 5000.0 | 2.396400 | 1.147663 | 1.0 | 1.0 | 2.0 | 3.0 | 4.0 |
| CCAvg | 5000.0 | 1.937938 | 1.747659 | 0.0 | 0.7 | 1.5 | 2.5 | 10.0 |
| Education | 5000.0 | 1.881000 | 0.839869 | 1.0 | 1.0 | 2.0 | 3.0 | 3.0 |
| Mortgage | 5000.0 | 56.498800 | 101.713802 | 0.0 | 0.0 | 0.0 | 101.0 | 635.0 |
| Personal_Loan | 5000.0 | 0.096000 | 0.294621 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| Securities_Account | 5000.0 | 0.104400 | 0.305809 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| CD_Account | 5000.0 | 0.060400 | 0.238250 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| Online | 5000.0 | 0.596800 | 0.490589 | 0.0 | 0.0 | 1.0 | 1.0 | 1.0 |
| CreditCard | 5000.0 | 0.294000 | 0.455637 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 |
As it can be seen that min value of experience is negative which is not valid. let's fix the data by replacing with +ve value
data['Experience'].unique()
array([ 1, 19, 15, 9, 8, 13, 27, 24, 10, 39, 5, 23, 32, 41, 30, 14, 18,
21, 28, 31, 11, 16, 20, 35, 6, 25, 7, 12, 26, 37, 17, 2, 36, 29,
3, 22, -1, 34, 0, 38, 40, 33, 4, -2, 42, -3, 43])
# Checking -ve experience value
data[data['Experience'] < 0]['Experience'].unique()
array([-1, -2, -3])
# Correcting the data by replacing with +ve value
data['Experience'].replace(-1, 1, inplace=True)
data['Experience'].replace(-2, 2, inplace=True)
data['Experience'].replace(-3, 3, inplace=True)
data['Experience'].unique()
array([ 1, 19, 15, 9, 8, 13, 27, 24, 10, 39, 5, 23, 32, 41, 30, 14, 18,
21, 28, 31, 11, 16, 20, 35, 6, 25, 7, 12, 26, 37, 17, 2, 36, 29,
3, 22, 34, 0, 38, 40, 33, 4, 42, 43])
data.isnull().sum()
| 0 | |
|---|---|
| Age | 0 |
| Experience | 0 |
| Income | 0 |
| ZIPCode | 0 |
| Family | 0 |
| CCAvg | 0 |
| Education | 0 |
| Mortgage | 0 |
| Personal_Loan | 0 |
| Securities_Account | 0 |
| CD_Account | 0 |
| Online | 0 |
| CreditCard | 0 |
data.duplicated().sum()
0
data['Securities_Account'].value_counts(normalize=True)
| proportion | |
|---|---|
| Securities_Account | |
| 0 | 0.8956 |
| 1 | 0.1044 |
data['CD_Account'].value_counts(normalize=True)
| proportion | |
|---|---|
| CD_Account | |
| 0 | 0.9396 |
| 1 | 0.0604 |
data['Online'].value_counts(normalize=True)
| proportion | |
|---|---|
| Online | |
| 1 | 0.5968 |
| 0 | 0.4032 |
data['CreditCard'].value_counts(normalize=True)
| proportion | |
|---|---|
| CreditCard | |
| 0 | 0.706 |
| 1 | 0.294 |
# Checking number of unique values in zipcode
data['ZIPCode'].unique()
array([91107, 90089, 94720, 94112, 91330, 92121, 91711, 93943, 93023,
94710, 90277, 93106, 94920, 91741, 95054, 95010, 94305, 91604,
94015, 90095, 91320, 95521, 95064, 90064, 94539, 94104, 94117,
94801, 94035, 92647, 95814, 94114, 94115, 92672, 94122, 90019,
95616, 94065, 95014, 91380, 95747, 92373, 92093, 94005, 90245,
95819, 94022, 90404, 93407, 94523, 90024, 91360, 95670, 95123,
90045, 91335, 93907, 92007, 94606, 94611, 94901, 92220, 93305,
95134, 94612, 92507, 91730, 94501, 94303, 94105, 94550, 92612,
95617, 92374, 94080, 94608, 93555, 93311, 94704, 92717, 92037,
95136, 94542, 94143, 91775, 92703, 92354, 92024, 92831, 92833,
94304, 90057, 92130, 91301, 92096, 92646, 92182, 92131, 93720,
90840, 95035, 93010, 94928, 95831, 91770, 90007, 94102, 91423,
93955, 94107, 92834, 93117, 94551, 94596, 94025, 94545, 95053,
90036, 91125, 95120, 94706, 95827, 90503, 90250, 95817, 95503,
93111, 94132, 95818, 91942, 90401, 93524, 95133, 92173, 94043,
92521, 92122, 93118, 92697, 94577, 91345, 94123, 92152, 91355,
94609, 94306, 96150, 94110, 94707, 91326, 90291, 92807, 95051,
94085, 92677, 92614, 92626, 94583, 92103, 92691, 92407, 90504,
94002, 95039, 94063, 94923, 95023, 90058, 92126, 94118, 90029,
92806, 94806, 92110, 94536, 90623, 92069, 92843, 92120, 95605,
90740, 91207, 95929, 93437, 90630, 90034, 90266, 95630, 93657,
92038, 91304, 92606, 92192, 90745, 95060, 94301, 92692, 92101,
94610, 90254, 94590, 92028, 92054, 92029, 93105, 91941, 92346,
94402, 94618, 94904, 93077, 95482, 91709, 91311, 94509, 92866,
91745, 94111, 94309, 90073, 92333, 90505, 94998, 94086, 94709,
95825, 90509, 93108, 94588, 91706, 92109, 92068, 95841, 92123,
91342, 90232, 92634, 91006, 91768, 90028, 92008, 95112, 92154,
92115, 92177, 90640, 94607, 92780, 90009, 92518, 91007, 93014,
94024, 90027, 95207, 90717, 94534, 94010, 91614, 94234, 90210,
95020, 92870, 92124, 90049, 94521, 95678, 95045, 92653, 92821,
90025, 92835, 91910, 94701, 91129, 90071, 96651, 94960, 91902,
90033, 95621, 90037, 90005, 93940, 91109, 93009, 93561, 95126,
94109, 93107, 94591, 92251, 92648, 92709, 91754, 92009, 96064,
91103, 91030, 90066, 95403, 91016, 95348, 91950, 95822, 94538,
92056, 93063, 91040, 92661, 94061, 95758, 96091, 94066, 94939,
95138, 95762, 92064, 94708, 92106, 92116, 91302, 90048, 90405,
92325, 91116, 92868, 90638, 90747, 93611, 95833, 91605, 92675,
90650, 95820, 90018, 93711, 95973, 92886, 95812, 91203, 91105,
95008, 90016, 90035, 92129, 90720, 94949, 90041, 95003, 95192,
91101, 94126, 90230, 93101, 91365, 91367, 91763, 92660, 92104,
91361, 90011, 90032, 95354, 94546, 92673, 95741, 95351, 92399,
90274, 94087, 90044, 94131, 94124, 95032, 90212, 93109, 94019,
95828, 90086, 94555, 93033, 93022, 91343, 91911, 94803, 94553,
95211, 90304, 92084, 90601, 92704, 92350, 94705, 93401, 90502,
94571, 95070, 92735, 95037, 95135, 94028, 96003, 91024, 90065,
95405, 95370, 93727, 92867, 95821, 94566, 95125, 94526, 94604,
96008, 93065, 96001, 95006, 90639, 92630, 95307, 91801, 94302,
91710, 93950, 90059, 94108, 94558, 93933, 92161, 94507, 94575,
95449, 93403, 93460, 95005, 93302, 94040, 91401, 95816, 92624,
95131, 94965, 91784, 91765, 90280, 95422, 95518, 95193, 92694,
90275, 90272, 91791, 92705, 91773, 93003, 90755, 96145, 94703,
96094, 95842, 94116, 90068, 94970, 90813, 94404, 94598])
# Check first 2 digit unique values in zipcode
data['ZIPCode'] = data['ZIPCode'].astype(str)
print(" Number of unique values if we take 1st two digit of zipcode: ", data['ZIPCode'].str[0:2].nunique())
Number of unique values if we take 1st two digit of zipcode: 7
# Update the data with 1st two digit of zipcode for better analysis
data['ZIPCode'] = data['ZIPCode'].str[0:2]
data['ZIPCode'] = data['ZIPCode'].astype('category')
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5000 entries, 0 to 4999 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Age 5000 non-null int64 1 Experience 5000 non-null int64 2 Income 5000 non-null int64 3 ZIPCode 5000 non-null category 4 Family 5000 non-null int64 5 CCAvg 5000 non-null float64 6 Education 5000 non-null int64 7 Mortgage 5000 non-null int64 8 Personal_Loan 5000 non-null int64 9 Securities_Account 5000 non-null int64 10 CD_Account 5000 non-null int64 11 Online 5000 non-null int64 12 CreditCard 5000 non-null int64 dtypes: category(1), float64(1), int64(11) memory usage: 474.1 KB
data.head()
| Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 25 | 1 | 49 | 91 | 4 | 1.6 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 1 | 45 | 19 | 34 | 90 | 3 | 1.5 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 2 | 39 | 15 | 11 | 94 | 1 | 1.0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 35 | 9 | 100 | 94 | 1 | 2.7 | 2 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 35 | 8 | 45 | 91 | 4 | 1.0 | 2 | 0 | 0 | 0 | 0 | 0 | 1 |
Questions:
# defining the figure size
plt.figure(figsize=(15,10))
# defining the list of numerical feature to plot
num_features = ['Age','Experience','Income','CCAvg','Mortgage']
# plotting the histogram for each numerical feature
for i, feature in enumerate(num_features):
plt.subplot(3,3,i+1)
sns.histplot(data=data, x=feature, kde=True)
plt.axvline(data[feature].mean(), color='green', linestyle="--")
plt.axvline(data[feature].median(), color="black", linestyle="-")
plt.tight_layout();
# defining the figure size
plt.figure(figsize=(15, 10))
# plotting the boxplot for each numerical feature
for i, feature in enumerate(num_features):
plt.subplot(3, 3, i+1) # assign a subplot in the main plot
sns.boxplot(data=data, x=feature) # plot the histogram
plt.tight_layout(); # to add spacing between plots
Age: The boxplot shows no apparent outliers, with the median age around 45. The interquartile range (IQR) spans approximately 35 to 55 years.
Experience: Similar to age, the experience distribution also has no outliers, with the median around 20 years and a wide IQR from 10 to 30 years.
Income: There are several outliers at the higher end (above 180k) also with the median income slightly below 70k.
CCAvg: This boxplot shows a large number of outliers at the higher end, above 5. The median usage is around 1.5k, and 75% of the data points fall below 2.5k.
Mortgage: There are numerous outliers, with many values exceeding above 250k. The majority of mortgages are clustered near zero, with the median likely nearby 0.
# function to create labeled barplots
def labeled_barplot(data, feature, perc=False, n=None):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(data[feature]) # length of the column
count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 2, 6))
else:
plt.figure(figsize=(n + 2, 6))
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
data=data,
x=feature,
hue=feature,
legend=False,
palette="Paired",
order=data[feature].value_counts().index[:n],
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
plt.show(); # show the plot
def stacked_barplot(data, predictor, target):
"""
Print the category counts and plot a stacked bar chart
data: dataframe
predictor: independent variable
target: target variable
"""
count = data[predictor].nunique()
sorter = data[target].value_counts().index[-1]
tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
by=sorter, ascending=False
)
print(tab1)
print("-" * 120)
tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
by=sorter, ascending=False
)
tab.plot(kind="bar", stacked=True, figsize=(count + 5, 5))
plt.legend(
loc="lower left", frameon=False,
)
plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
plt.show()
### function to plot distributions wrt target
def distribution_plot_wrt_target(data, predictor, target):
fig, axs = plt.subplots(2, 2, figsize=(12, 10))
target_uniq = data[target].unique()
axs[0, 0].set_title("Distribution of target for target=" + str(target_uniq[0]))
sns.histplot(
data=data[data[target] == target_uniq[0]],
x=predictor,
kde=True,
ax=axs[0, 0],
color="teal",
stat="density",
)
axs[0, 1].set_title("Distribution of target for target=" + str(target_uniq[1]))
sns.histplot(
data=data[data[target] == target_uniq[1]],
x=predictor,
kde=True,
ax=axs[0, 1],
color="orange",
stat="density",
)
axs[1, 0].set_title("Boxplot w.r.t target")
sns.boxplot(data=data, x=target, y=predictor, ax=axs[1, 0])
axs[1, 1].set_title("Boxplot (without outliers) w.r.t target")
sns.boxplot(
data=data,
x=target,
y=predictor,
ax=axs[1, 1],
showfliers=False
)
plt.tight_layout()
plt.show()
labeled_barplot(data, 'Family', perc=True)
labeled_barplot(data, 'Education', perc=True)
labeled_barplot(data, 'Securities_Account', perc=True)
labeled_barplot(data, 'CD_Account', perc=True)
labeled_barplot(data, 'Online', perc=True)
labeled_barplot(data, 'CreditCard', perc=True)
labeled_barplot(data, 'ZIPCode', perc=True, n=20)
# data['Personal_Loan'].value_counts(normalize=True)
loan_stats =pd.DataFrame( data['Personal_Loan'].value_counts(normalize=True)).reset_index()
loan_stats.columns =["Labels","Personal Loan"]
plt.pie(loan_stats['Personal Loan'], labels=loan_stats['Labels'],autopct='%.0f%%' );
sns.pairplot(data)
<seaborn.axisgrid.PairGrid at 0x7e7a5747a6b0>
cols_list = data.select_dtypes(include=['number']).columns.tolist()
plt.figure(figsize=(12, 7))
sns.heatmap(
data[cols_list].corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral"
)
plt.show()
distribution_plot_wrt_target(data, 'Age', 'Personal_Loan')
distribution_plot_wrt_target(data, 'CD_Account', 'Personal_Loan')
distribution_plot_wrt_target(data, 'Experience', 'Personal_Loan')
distribution_plot_wrt_target(data, 'Education', 'Personal_Loan')
data.groupby(['Education'])['Personal_Loan'].value_counts(normalize=True)
| proportion | ||
|---|---|---|
| Education | Personal_Loan | |
| 1 | 0 | 0.955630 |
| 1 | 0.044370 | |
| 2 | 0 | 0.870278 |
| 1 | 0.129722 | |
| 3 | 0 | 0.863424 |
| 1 | 0.136576 |
distribution_plot_wrt_target(data, 'Income', 'Personal_Loan')
distribution_plot_wrt_target(data, 'CCAvg', 'Personal_Loan')
distribution_plot_wrt_target(data, 'ZIPCode', 'Personal_Loan')
distribution_plot_wrt_target(data, 'Mortgage', 'Personal_Loan')
# outlier detection using boxplot
plt.figure(figsize=(15, 12))
for i, variable in enumerate(num_features):
plt.subplot(4, 4, i + 1)
plt.boxplot(data[variable], whis=1.5)
plt.tight_layout()
plt.title(variable)
plt.show()
# Find 25th percentile and 75th percentile
Q1 = data[num_features].quantile(0.25)
Q3 = data[num_features].quantile(0.75)
# Calculate Inter Quantile Range
IQR = Q3 - Q1
#Finding upper and lower quantile
lower = Q1 - 1.5*IQR
upper = Q3 + 1.5*IQR
# Calculate outliers for each column (using element-wise comparison) for numeric columns
outliers = (data[num_features] < lower) | (data[num_features] > upper)
# Calculate the percentage of outliers in each column
outliers_percentage = (outliers.sum() / len(data)) * 100
# Display the percentage of outliers for each column
print(outliers_percentage)
Age 0.00 Experience 0.00 Income 1.92 CCAvg 6.48 Mortgage 5.82 dtype: float64
outliers.sum()
| 0 | |
|---|---|
| Age | 0 |
| Experience | 0 |
| Income | 96 |
| CCAvg | 324 |
| Mortgage | 291 |
data[data['CCAvg'] > upper['CCAvg']]
| Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 9 | 34 | 9 | 180 | 93 | 1 | 8.90 | 3 | 0 | 1 | 0 | 0 | 0 | 0 |
| 18 | 46 | 21 | 193 | 91 | 2 | 8.10 | 3 | 0 | 1 | 0 | 0 | 0 | 0 |
| 44 | 46 | 20 | 104 | 94 | 1 | 5.70 | 1 | 0 | 0 | 0 | 0 | 1 | 1 |
| 55 | 41 | 17 | 139 | 94 | 2 | 8.00 | 1 | 0 | 0 | 0 | 0 | 1 | 0 |
| 61 | 47 | 21 | 125 | 93 | 1 | 5.70 | 1 | 112 | 0 | 1 | 0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 4908 | 40 | 16 | 138 | 92 | 2 | 6.10 | 1 | 0 | 0 | 0 | 0 | 1 | 0 |
| 4911 | 46 | 22 | 153 | 94 | 2 | 7.50 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4937 | 33 | 8 | 162 | 94 | 1 | 8.60 | 1 | 0 | 0 | 0 | 1 | 1 | 1 |
| 4980 | 29 | 5 | 135 | 95 | 3 | 5.30 | 1 | 0 | 1 | 0 | 1 | 1 | 1 |
| 4993 | 45 | 21 | 218 | 91 | 2 | 6.67 | 1 | 0 | 0 | 0 | 0 | 1 | 0 |
324 rows × 13 columns
# Dropping Personal loan(target variable) and Experience column from data as experience is strongly/perfectly correlated with Age
X = data.drop(['Personal_Loan', 'Experience'], axis=1)
Y = data['Personal_Loan']
# Get dummy columns for categorial features
X = pd.get_dummies(X, columns=["ZIPCode", "Education"], drop_first=True)
X = X.astype(float)
# Splitting data in train and test sets
X_train, X_test, y_train, y_test = train_test_split(
X, Y, test_size=0.30, random_state=1
)
print("Shape of Training set : ", X_train.shape)
print("Shape of test set : ", X_test.shape)
print("Percentage of classes in training set:")
print(y_train.value_counts(normalize=True))
print("Percentage of classes in test set:")
print(y_test.value_counts(normalize=True))
Shape of Training set : (3500, 17) Shape of test set : (1500, 17) Percentage of classes in training set: Personal_Loan 0 0.905429 1 0.094571 Name: proportion, dtype: float64 Percentage of classes in test set: Personal_Loan 0 0.900667 1 0.099333 Name: proportion, dtype: float64
Model can make wrong predictions as:
Which case is more important
How to reduce the losses?
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
"""
# predicting using the independent variables
pred = model.predict(predictors)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
index=[0],
)
return df_perf
def confusion_matrix_sklearn(model, predictors, target):
"""
To plot the confusion_matrix with percentages
model: classifier
predictors: independent variables
target: dependent variable
"""
y_pred = model.predict(predictors)
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
model = DecisionTreeClassifier(random_state=1)
model.fit(X_train, y_train)
DecisionTreeClassifier(random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
DecisionTreeClassifier(random_state=1)
# Confusion matrix
confusion_matrix_sklearn(model, X_train, y_train)
# compute different metrics to check performance on training set
default_decision_tree_perf_train = model_performance_classification_sklearn(model, X_train, y_train)
default_decision_tree_perf_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 1.0 | 1.0 | 1.0 | 1.0 |
Checking model performance on testing data
# compute different metrics to check performance
default_decision_tree_perf_test = model_performance_classification_sklearn(model, X_test, y_test)
default_decision_tree_perf_test
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.986 | 0.932886 | 0.926667 | 0.929766 |
# Confusion matrix
confusion_matrix_sklearn(model, X_test, y_test)
# list of feature names in X_train
feature_names = list(X_train.columns)
# set the figure size for the plot
plt.figure(figsize=(20, 20))
# plotting the decision tree
out = tree.plot_tree(
model, # decision tree classifier model
feature_names=feature_names, # list of feature names (columns) in the dataset
filled=True, # fill the nodes with colors based on class
fontsize=9, # font size for the node text
node_ids=True, # do not show the ID of each node
class_names=None, # whether or not to display class names
)
# add arrows to the decision tree splits if they are missing
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black") # set arrow color to black
arrow.set_linewidth(1) # set arrow linewidth to 1
# displaying the plot
plt.show()
# printing a text report showing the rules of a decision tree
print(
tree.export_text(
model, # specify the model
feature_names=feature_names, # specify the feature names
show_weights=True # specify whether or not to show the weights associated with the model
)
)
|--- Income <= 116.50 | |--- CCAvg <= 2.95 | | |--- Income <= 106.50 | | | |--- weights: [2553.00, 0.00] class: 0 | | |--- Income > 106.50 | | | |--- Family <= 3.50 | | | | |--- ZIPCode_93 <= 0.50 | | | | | |--- Age <= 28.50 | | | | | | |--- Education_2 <= 0.50 | | | | | | | |--- weights: [5.00, 0.00] class: 0 | | | | | | |--- Education_2 > 0.50 | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | |--- Age > 28.50 | | | | | | |--- CCAvg <= 2.20 | | | | | | | |--- weights: [48.00, 0.00] class: 0 | | | | | | |--- CCAvg > 2.20 | | | | | | | |--- Education_3 <= 0.50 | | | | | | | | |--- weights: [7.00, 0.00] class: 0 | | | | | | | |--- Education_3 > 0.50 | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | |--- ZIPCode_93 > 0.50 | | | | | |--- Age <= 37.50 | | | | | | |--- weights: [2.00, 0.00] class: 0 | | | | | |--- Age > 37.50 | | | | | | |--- Income <= 112.00 | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | |--- Income > 112.00 | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | |--- Family > 3.50 | | | | |--- Age <= 32.50 | | | | | |--- CCAvg <= 2.40 | | | | | | |--- weights: [12.00, 0.00] class: 0 | | | | | |--- CCAvg > 2.40 | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | |--- Age > 32.50 | | | | | |--- Age <= 60.00 | | | | | | |--- weights: [0.00, 6.00] class: 1 | | | | | |--- Age > 60.00 | | | | | | |--- weights: [4.00, 0.00] class: 0 | |--- CCAvg > 2.95 | | |--- Income <= 92.50 | | | |--- CD_Account <= 0.50 | | | | |--- Age <= 26.50 | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | |--- Age > 26.50 | | | | | |--- CCAvg <= 3.55 | | | | | | |--- CCAvg <= 3.35 | | | | | | | |--- Age <= 37.50 | | | | | | | | |--- Age <= 33.50 | | | | | | | | | |--- weights: [3.00, 0.00] class: 0 | | | | | | | | |--- Age > 33.50 | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | |--- Age > 37.50 | | | | | | | | |--- Income <= 82.50 | | | | | | | | | |--- weights: [23.00, 0.00] class: 0 | | | | | | | | |--- Income > 82.50 | | | | | | | | | |--- Income <= 83.50 | | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | | | |--- Income > 83.50 | | | | | | | | | | |--- weights: [5.00, 0.00] class: 0 | | | | | | |--- CCAvg > 3.35 | | | | | | | |--- Family <= 3.00 | | | | | | | | |--- weights: [0.00, 5.00] class: 1 | | | | | | | |--- Family > 3.00 | | | | | | | | |--- weights: [9.00, 0.00] class: 0 | | | | | |--- CCAvg > 3.55 | | | | | | |--- Income <= 81.50 | | | | | | | |--- weights: [43.00, 0.00] class: 0 | | | | | | |--- Income > 81.50 | | | | | | | |--- Education_2 <= 0.50 | | | | | | | | |--- Mortgage <= 93.50 | | | | | | | | | |--- weights: [26.00, 0.00] class: 0 | | | | | | | | |--- Mortgage > 93.50 | | | | | | | | | |--- Mortgage <= 104.50 | | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | | | |--- Mortgage > 104.50 | | | | | | | | | | |--- weights: [6.00, 0.00] class: 0 | | | | | | | |--- Education_2 > 0.50 | | | | | | | | |--- ZIPCode_91 <= 0.50 | | | | | | | | | |--- Family <= 3.50 | | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | | | |--- Family > 3.50 | | | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | | | |--- ZIPCode_91 > 0.50 | | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | |--- CD_Account > 0.50 | | | | |--- weights: [0.00, 5.00] class: 1 | | |--- Income > 92.50 | | | |--- Family <= 2.50 | | | | |--- Education_2 <= 0.50 | | | | | |--- Education_3 <= 0.50 | | | | | | |--- CD_Account <= 0.50 | | | | | | | |--- Age <= 56.50 | | | | | | | | |--- weights: [27.00, 0.00] class: 0 | | | | | | | |--- Age > 56.50 | | | | | | | | |--- Online <= 0.50 | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | | |--- Online > 0.50 | | | | | | | | | |--- weights: [2.00, 0.00] class: 0 | | | | | | |--- CD_Account > 0.50 | | | | | | | |--- Securities_Account <= 0.50 | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | | |--- Securities_Account > 0.50 | | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | |--- Education_3 > 0.50 | | | | | | |--- ZIPCode_94 <= 0.50 | | | | | | | |--- Income <= 107.00 | | | | | | | | |--- weights: [7.00, 0.00] class: 0 | | | | | | | |--- Income > 107.00 | | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | | |--- ZIPCode_94 > 0.50 | | | | | | | |--- weights: [0.00, 5.00] class: 1 | | | | |--- Education_2 > 0.50 | | | | | |--- weights: [0.00, 4.00] class: 1 | | | |--- Family > 2.50 | | | | |--- Age <= 57.50 | | | | | |--- CCAvg <= 4.85 | | | | | | |--- weights: [0.00, 17.00] class: 1 | | | | | |--- CCAvg > 4.85 | | | | | | |--- CCAvg <= 4.95 | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | |--- CCAvg > 4.95 | | | | | | | |--- weights: [0.00, 3.00] class: 1 | | | | |--- Age > 57.50 | | | | | |--- ZIPCode_93 <= 0.50 | | | | | | |--- ZIPCode_94 <= 0.50 | | | | | | | |--- weights: [5.00, 0.00] class: 0 | | | | | | |--- ZIPCode_94 > 0.50 | | | | | | | |--- Age <= 59.50 | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | |--- Age > 59.50 | | | | | | | | |--- weights: [2.00, 0.00] class: 0 | | | | | |--- ZIPCode_93 > 0.50 | | | | | | |--- weights: [0.00, 2.00] class: 1 |--- Income > 116.50 | |--- Family <= 2.50 | | |--- Education_3 <= 0.50 | | | |--- Education_2 <= 0.50 | | | | |--- weights: [375.00, 0.00] class: 0 | | | |--- Education_2 > 0.50 | | | | |--- weights: [0.00, 53.00] class: 1 | | |--- Education_3 > 0.50 | | | |--- weights: [0.00, 62.00] class: 1 | |--- Family > 2.50 | | |--- weights: [0.00, 154.00] class: 1
As the frequency of class 1 is 10% and the frequency of class 0 is 90%, then class 0 will become the dominant class and the decision tree will become biased toward the dominant classes
In this case, we will set class_weight = "balanced", which will automatically adjust the weights to be inversely proportional to the class frequencies in the input data
model1 = DecisionTreeClassifier(random_state=1, class_weight="balanced")
model1.fit(X_train, y_train)
DecisionTreeClassifier(class_weight='balanced', random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
DecisionTreeClassifier(class_weight='balanced', random_state=1)
confusion_matrix_sklearn(model1, X_train, y_train)
model1_perf_train = model_performance_classification_sklearn(
model1, X_train, y_train
)
model1_perf_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 1.0 | 1.0 | 1.0 | 1.0 |
confusion_matrix_sklearn(model1, X_test, y_test)
model1_perf_test = model_performance_classification_sklearn(
model1, X_test, y_test
)
model1_perf_test
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.978667 | 0.865772 | 0.914894 | 0.889655 |
# Define the parameters of the tree to iterate over
max_depth_values = np.arange(1, 11, 2)
max_leaf_nodes_values = [50, 75, 150, 250]
min_samples_split_values = [10, 30, 50, 70]
# Initialize variables to store the best model and its performance
best_estimator = None
best_score_diff = float('inf')
best_test_score = 0.0
# Iterate over all combinations of the specified parameter values
for max_depth in max_depth_values:
for max_leaf_nodes in max_leaf_nodes_values:
for min_samples_split in min_samples_split_values:
# Initialize the tree with the current set of parameters
estimator = DecisionTreeClassifier(
max_depth=max_depth,
class_weight="balanced",
max_leaf_nodes=max_leaf_nodes,
min_samples_split=min_samples_split,
random_state=1
)
# Fit the model to the training data
estimator.fit(X_train, y_train)
# Make predictions on the training and test sets
y_train_pred = estimator.predict(X_train)
y_test_pred = estimator.predict(X_test)
# Calculate recall scores for training and test sets
train_recall_score = recall_score(y_train, y_train_pred)
test_recall_score = recall_score(y_test, y_test_pred)
# Calculate the absolute difference between training and test recall scores
score_diff = abs(train_recall_score - test_recall_score)
# Update the best estimator and best score if the current one has a smaller score difference
if (score_diff < best_score_diff) & (test_recall_score > best_test_score):
best_score_diff = score_diff
best_test_score = test_recall_score
best_estimator = estimator
# Print the best parameters
print("Best parameters found:")
print(f"Max depth: {best_estimator.max_depth}")
print(f"Max leaf nodes: {best_estimator.max_leaf_nodes}")
print(f"Min samples split: {best_estimator.min_samples_split}")
print(f"Best test recall score: {best_test_score}")
Best parameters found: Max depth: 3 Max leaf nodes: 50 Min samples split: 10 Best test recall score: 0.9530201342281879
model2 = best_estimator
model2.fit(X_train, y_train)
DecisionTreeClassifier(class_weight='balanced', max_depth=3, max_leaf_nodes=50,
min_samples_split=10, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. DecisionTreeClassifier(class_weight='balanced', max_depth=3, max_leaf_nodes=50,
min_samples_split=10, random_state=1)confusion_matrix_sklearn(model2, X_train, y_train)
model2_pre_perf_train = model_performance_classification_sklearn(model2, X_train, y_train)
model2_pre_perf_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.820857 | 0.969789 | 0.342217 | 0.50591 |
confusion_matrix_sklearn(model2, X_test, y_test)
model2_pre_perf_test = model_performance_classification_sklearn(model2, X_test, y_test)
model2_pre_perf_test
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.801333 | 0.95302 | 0.327945 | 0.487973 |
feature_names = list(X_train.columns)
importances = model2.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(20, 10))
out = tree.plot_tree(
model2,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=True,
class_names=None,
)
# below code will add arrows to the decision tree split if they are missing
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
# Text report showing the rules of a decision tree -
print(tree.export_text(model2, feature_names=feature_names, show_weights=True))
|--- Income <= 92.50 | |--- CCAvg <= 2.95 | | |--- weights: [1344.67, 0.00] class: 0 | |--- CCAvg > 2.95 | | |--- CD_Account <= 0.50 | | | |--- weights: [64.61, 52.87] class: 0 | | |--- CD_Account > 0.50 | | | |--- weights: [0.00, 26.44] class: 1 |--- Income > 92.50 | |--- Family <= 2.50 | | |--- Education_3 <= 0.50 | | | |--- weights: [280.53, 322.51] class: 1 | | |--- Education_3 > 0.50 | | | |--- weights: [17.67, 375.38] class: 1 | |--- Family > 2.50 | | |--- Income <= 113.50 | | | |--- weights: [39.21, 126.89] class: 1 | | |--- Income > 113.50 | | | |--- weights: [3.31, 845.92] class: 1
importances = model2.feature_importances_
importances
array([0. , 0.81206256, 0.05162516, 0.06113121, 0. ,
0. , 0.01010733, 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0.06507374])
# importance of features in the tree building
importances = model2.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(8, 8))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
clf = DecisionTreeClassifier(random_state=1, class_weight="balanced")
path = clf.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas, impurities = abs(path.ccp_alphas), path.impurities
pd.DataFrame(path)
| ccp_alphas | impurities | |
|---|---|---|
| 0 | 0.000000e+00 | -7.759588e-16 |
| 1 | 3.853725e-19 | -7.755734e-16 |
| 2 | 4.729571e-19 | -7.751004e-16 |
| 3 | 5.255079e-19 | -7.745749e-16 |
| 4 | 5.255079e-19 | -7.740494e-16 |
| 5 | 7.707449e-19 | -7.732787e-16 |
| 6 | 1.051016e-18 | -7.722277e-16 |
| 7 | 1.576524e-18 | -7.706511e-16 |
| 8 | 1.257806e-17 | -7.580731e-16 |
| 9 | 1.524700e-04 | 3.049400e-04 |
| 10 | 1.567552e-04 | 6.184504e-04 |
| 11 | 2.761014e-04 | 1.170653e-03 |
| 12 | 2.857143e-04 | 1.456367e-03 |
| 13 | 3.036629e-04 | 2.367356e-03 |
| 14 | 3.090999e-04 | 2.676456e-03 |
| 15 | 4.498426e-04 | 3.576141e-03 |
| 16 | 4.663329e-04 | 4.975140e-03 |
| 17 | 4.714887e-04 | 6.861095e-03 |
| 18 | 5.517639e-04 | 7.412859e-03 |
| 19 | 5.900285e-04 | 8.002887e-03 |
| 20 | 6.292497e-04 | 9.261386e-03 |
| 21 | 6.616439e-04 | 1.058467e-02 |
| 22 | 7.538167e-04 | 1.133849e-02 |
| 23 | 7.580649e-04 | 1.285462e-02 |
| 24 | 8.728080e-04 | 1.634585e-02 |
| 25 | 8.981804e-04 | 1.724403e-02 |
| 26 | 9.269294e-04 | 1.817096e-02 |
| 27 | 1.461815e-03 | 1.963278e-02 |
| 28 | 1.768018e-03 | 2.140080e-02 |
| 29 | 1.981730e-03 | 2.536426e-02 |
| 30 | 2.150414e-03 | 2.751467e-02 |
| 31 | 2.375809e-03 | 2.989048e-02 |
| 32 | 2.472660e-03 | 3.483580e-02 |
| 33 | 3.297255e-03 | 3.813305e-02 |
| 34 | 3.344493e-03 | 4.147755e-02 |
| 35 | 3.503794e-03 | 4.498134e-02 |
| 36 | 3.602932e-03 | 5.218720e-02 |
| 37 | 3.729690e-03 | 5.591689e-02 |
| 38 | 4.941457e-03 | 6.085835e-02 |
| 39 | 4.970987e-03 | 7.080032e-02 |
| 40 | 2.255792e-02 | 9.335825e-02 |
| 41 | 3.708749e-02 | 2.046207e-01 |
| 42 | 2.953793e-01 | 5.000000e-01 |
fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(ccp_alphas[:-1], impurities[:-1], marker="o", drawstyle="steps-post")
ax.set_xlabel("effective alpha")
ax.set_ylabel("total impurity of leaves")
ax.set_title("Total Impurity vs effective alpha for training set")
plt.show()
Next, we train a decision tree using the effective alphas. The last value in ccp_alphas is the alpha value that prunes the whole tree, leaving the tree, clfs[-1], with one node.
clfs = []
for ccp_alpha in ccp_alphas:
clf = DecisionTreeClassifier(
random_state=1, ccp_alpha=ccp_alpha, class_weight="balanced"
)
clf.fit(X_train, y_train)
clfs.append(clf)
print(
"Number of nodes in the last tree is: {} with ccp_alpha: {}".format(
clfs[-1].tree_.node_count, ccp_alphas[-1]
)
)
Number of nodes in the last tree is: 1 with ccp_alpha: 0.2953792759992314
For the remainder, we remove the last element in clfs and ccp_alphas, because it is the trivial tree with only one node. Here we show that the number of nodes and tree depth decreases as alpha increases.
clfs = clfs[:-1]
ccp_alphas = ccp_alphas[:-1]
node_counts = [clf.tree_.node_count for clf in clfs]
depth = [clf.tree_.max_depth for clf in clfs]
fig, ax = plt.subplots(2, 1, figsize=(10, 7))
ax[0].plot(ccp_alphas, node_counts, marker="o", drawstyle="steps-post")
ax[0].set_xlabel("alpha")
ax[0].set_ylabel("number of nodes")
ax[0].set_title("Number of nodes vs alpha")
ax[1].plot(ccp_alphas, depth, marker="o", drawstyle="steps-post")
ax[1].set_xlabel("alpha")
ax[1].set_ylabel("depth of tree")
ax[1].set_title("Depth vs alpha")
fig.tight_layout()
recall_train = []
for clf in clfs:
pred_train = clf.predict(X_train)
values_train = recall_score(y_train, pred_train)
recall_train.append(values_train)
recall_test = []
for clf in clfs:
pred_test = clf.predict(X_test)
values_test = recall_score(y_test, pred_test)
recall_test.append(values_test)
train_scores = [clf.score(X_train, y_train) for clf in clfs]
test_scores = [clf.score(X_test, y_test) for clf in clfs]
fig, ax = plt.subplots(figsize=(15, 5))
ax.set_xlabel("alpha")
ax.set_ylabel("Recall")
ax.set_title("Recall vs alpha for training and testing sets")
ax.plot(
ccp_alphas, recall_train, marker="o", label="train", drawstyle="steps-post",
)
ax.plot(ccp_alphas, recall_test, marker="o", label="test", drawstyle="steps-post")
ax.legend()
plt.show()
# creating the model where we get highest train and test recall
index_best_model = np.argmax(recall_test)
best_model = clfs[index_best_model]
print(best_model)
DecisionTreeClassifier(ccp_alpha=0.0024726598786422157, class_weight='balanced',
random_state=1)
model3 = best_model
confusion_matrix_sklearn(model3, X_train, y_train)
model3_post_perf_train = model_performance_classification_sklearn(
model3, X_train, y_train
)
model3_post_perf_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.956857 | 1.0 | 0.686722 | 0.814268 |
confusion_matrix_sklearn(model3, X_test, y_test)
model3_post_perf_test = model_performance_classification_sklearn(
model3, X_test, y_test
)
model3_post_perf_test
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.948667 | 0.993289 | 0.660714 | 0.793566 |
In the post-pruned tree also, the model is giving a generalized result since the recall scores on both the train and test data are coming to be around 1 which shows that the model is able to generalize well on unseen data.
plt.figure(figsize=(20, 10))
out = tree.plot_tree(
model3,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=False,
class_names=True,
)
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
# Text report showing the rules of a decision tree -
print(tree.export_text(model3, feature_names=feature_names, show_weights=True))
|--- Income <= 92.50 | |--- CCAvg <= 2.95 | | |--- weights: [1344.67, 0.00] class: 0 | |--- CCAvg > 2.95 | | |--- CD_Account <= 0.50 | | | |--- CCAvg <= 3.95 | | | | |--- weights: [41.42, 52.87] class: 1 | | | |--- CCAvg > 3.95 | | | | |--- weights: [23.19, 0.00] class: 0 | | |--- CD_Account > 0.50 | | | |--- weights: [0.00, 26.44] class: 1 |--- Income > 92.50 | |--- Family <= 2.50 | | |--- Education_3 <= 0.50 | | | |--- Education_2 <= 0.50 | | | | |--- Income <= 103.50 | | | | | |--- CCAvg <= 3.21 | | | | | | |--- weights: [22.09, 0.00] class: 0 | | | | | |--- CCAvg > 3.21 | | | | | | |--- weights: [2.76, 15.86] class: 1 | | | | |--- Income > 103.50 | | | | | |--- weights: [239.11, 0.00] class: 0 | | | |--- Education_2 > 0.50 | | | | |--- Income <= 110.00 | | | | | |--- CCAvg <= 2.90 | | | | | | |--- weights: [12.70, 0.00] class: 0 | | | | | |--- CCAvg > 2.90 | | | | | | |--- weights: [0.00, 10.57] class: 1 | | | | |--- Income > 110.00 | | | | | |--- weights: [3.87, 296.07] class: 1 | | |--- Education_3 > 0.50 | | | |--- weights: [17.67, 375.38] class: 1 | |--- Family > 2.50 | | |--- Income <= 113.50 | | | |--- CCAvg <= 2.80 | | | | |--- Income <= 106.50 | | | | | |--- weights: [24.85, 0.00] class: 0 | | | | |--- Income > 106.50 | | | | | |--- weights: [11.04, 31.72] class: 1 | | | |--- CCAvg > 2.80 | | | | |--- weights: [3.31, 95.17] class: 1 | | |--- Income > 113.50 | | | |--- weights: [3.31, 845.92] class: 1
importances = model3.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
# training performance comparison
models_train_comp_df = pd.concat(
[
default_decision_tree_perf_train.T,
model1_perf_train.T,
model2_pre_perf_train.T,
model3_post_perf_train.T,
],
axis=1,
)
models_train_comp_df.columns = [
"Decision Tree (sklearn default)",
"Decision Tree with class_weight",
"Decision Tree (Pre-Pruning)",
"Decision Tree (Post-Pruning)",
]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
| Decision Tree (sklearn default) | Decision Tree with class_weight | Decision Tree (Pre-Pruning) | Decision Tree (Post-Pruning) | |
|---|---|---|---|---|
| Accuracy | 1.0 | 1.0 | 0.820857 | 0.956857 |
| Recall | 1.0 | 1.0 | 0.969789 | 1.000000 |
| Precision | 1.0 | 1.0 | 0.342217 | 0.686722 |
| F1 | 1.0 | 1.0 | 0.505910 | 0.814268 |
# testing performance comparison
models_test_comp_df = pd.concat(
[
default_decision_tree_perf_test.T,
model1_perf_test.T,
model2_pre_perf_test.T,
model3_post_perf_test.T,
],
axis=1,
)
models_test_comp_df.columns = [
"Decision Tree (sklearn default)",
"Decision Tree with class_weight",
"Decision Tree (Pre-Pruning)",
"Decision Tree (Post-Pruning)",
]
print("Test set performance comparison:")
models_test_comp_df
Test set performance comparison:
| Decision Tree (sklearn default) | Decision Tree with class_weight | Decision Tree (Pre-Pruning) | Decision Tree (Post-Pruning) | |
|---|---|---|---|---|
| Accuracy | 0.986000 | 0.978667 | 0.801333 | 0.948667 |
| Recall | 0.932886 | 0.865772 | 0.953020 | 0.993289 |
| Precision | 0.926667 | 0.914894 | 0.327945 | 0.660714 |
| F1 | 0.929766 | 0.889655 | 0.487973 | 0.793566 |
!jupyter nbconvert --to html '/content/drive/MyDrive/Colab Notebooks/AIMLCourse/Machine_Learning/Project02/AIML_ML_Project_full_code_notebook.ipynb' --output-dir '/content/drive/MyDrive/Colab Notebooks/AIMLCourse/Machine_Learning/Project02'
[NbConvertApp] Converting notebook /content/drive/MyDrive/Colab Notebooks/AIMLCourse/Machine_Learning/Project02/AIML_ML_Project_full_code_notebook.ipynb to html [NbConvertApp] Writing 5284135 bytes to /content/drive/MyDrive/Colab Notebooks/AIMLCourse/Machine_Learning/Project02/AIML_ML_Project_full_code_notebook.html